60 research outputs found

    Performance Evaluation of Parallel Sparse Matrix–Vector Products on SGI Altix3700

    Get PDF
    Abstract. The present paper discusses scalable implementations of sparse matrix-vector products, which are crucial for high performance solutions of large-scale linear equations, on a cc-NUMA machine SGI Altix3700. Three storage formats for sparse matrices are evaluated, and scalability is attained by implementations considering the page allocation mechanism of the NUMA machine. Influences of the cache/memory bus architectures on the optimum choice of the storage format are examined, and scalable converters between storage formats shown to facilitate exploitation of storage formats of higher performance.

    Linpack evaluation on a supercomputer with heterogeneous accelerators

    Full text link
    Abstract—We report Linpack benchmark results on the TSUBAME supercomputer, a large scale heterogeneous system equipped with NVIDIA Tesla GPUs and ClearSpeed SIMD accelerators. With all of 10,480 Opteron cores, 640 Xeon cores, 648 ClearSpeed accelerators and 624 NVIDIA Tesla GPUs, we have achieved 87.01TFlops, which is the third record as a heterogeneous system in the world. This paper describes careful tuning and load balancing method required to achieve this performance. On the other hand, since the peak speed is 163 TFlops, the efficiency is 53%, which is lower than other systems. This paper also analyses this gap from the aspect of system architecture. I

    Efficient high-precision integer multiplication on the GPU

    Get PDF
    Dieguez AP, Amor M, Doallo R, Nukada A, Matsuoka S. Efficient high precision integer multiplication on the GPU. The International Journal of High Performance Computing Applications. 2022;36(3):356-369.© The Author(s) 2022. Publisher: SAGE Publications. https://doi.org/10.1177/10943420221077964[Abstract]: The multiplication of large integers, which has many applications in computer science, is an operation that can be expressed as a polynomial multiplication followed by a carry normalization. This work develops two approaches for efficient polynomial multiplication: one approach is based on tiling the classical convolution algorithm, but taking advantage of new CUDA architectures, a novelty approach to compute the multiplication using integers without accuracy lossless; the other one is based on the Strassen algorithm, an algorithm that multiplies large polynomials using the FFT operation, but adapting the fastest FFT libraries for current GPUs and working on the complex field. Previous studies reported that the Strassen algorithm is an effective implementation for “large enough” integers on GPUs. Additionally, most previous studies do not examine the implementation of the carry normalization, but this work describes a parallel implementation for this operation. Our results show the efficiency of our approaches for short, medium, and large sizes.The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work has been supported by the Ministry of Science and Innovation of Spain (PID2019-104184RB-I00), by the Galician Government and FEDER funds under the Consolidation Program of Competitive Reference Groups (UDC/GI-000265, ref. ED431C 2021/30), by the Consolidation Program of Competitive Research Units (ED431G2019/01), and by the FPU Program of the Ministry of Education of Spain (FPU14/02801). It is also partially supported by JST CREST [JPMJCR1303 and JPMJCR1687] and NVIDIA GPU Center of Excellence and conducted as research activities of AIST-TokyoTech Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL).Xunta de Galicia; ED431C 2021/3

    Modeling gather and scatter with hardware performance counters for Xeon Phi

    No full text

    YAMAZAKI, NUKADA, MOCHIMARU: HAMMING COLOR CODE 1 Hamming Color Code for Dense and Robust One-shot 3D Scanning

    No full text
    We propose a novel color code, Hamming color code, designed for rapid 3D shape acquisition using structured light projection. The Hamming color code has several properties which are desirable for practical 3D acquisition as follows. First, the Hamming distance of adjacent colors is always 1, which makes the color detection robust to color blending due to defocusing, subsurface scattering, or chromatic aberration. Second, the substrings of a certain length is guaranteed to be unique. In other words, the Hamming code can be viewed as a subset of de Bruijn sequence. Third, a one-dimensional coordinate can be encoded for each pixel, which enables dense 3D reconstruction from a single pattern projection. Thanks to the uniqueness and robustness of the substrings, the structured light can be decoded stably by dynamic programming. We have implemented parallel dynamic programming on GPU and achieved the speed-up by a factor of 630 compared to the CPU-based implementation, and accomplished video-rate 3D acquisition using commodity hardware. Several experiments have been conducted to demonstrate the stability and performance of our algorithm. Finally we discuss the limitation and future direction of this work.
    corecore